Separating Flowers

This notebook explores a classic Machine Learning Dataset: the Iris flower dataset

Tutorial goals

Explore the dataset
Build a simple predictive modeling
Iterate and improve your score

How to follow along:

git clone https://github.com/dataweekends/pyladies_intro_to_data_science

cd pyladies_intro_to_data_science

ipython notebook

We start by importing the necessary libraries:



In [ ]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

1) Explore the dataset

Numerical exploration

Load the csv file into memory using Pandas
Describe each attribute
- is it discrete?
- is it continuous?
- is it a number?
Identify the target
Check if any values are missing

Load the csv file into memory using Pandas



In [ ]:

    
df = pd.read_csv('iris-2-classes.csv')

What's the content of df ?



In [ ]:

    
df.iloc[[0,1,98,99]]

Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)



In [ ]:

    
df.info()

Quick stats on the features



In [ ]:

    
df.describe()

Identify the target

What are we trying to predict?

ah, yes... the type of Iris flower!



In [ ]:

    
df['iris_type'].value_counts()

Check if any values are missing



In [ ]:

    
df.info()

Mental notes so far:

Dataset contains 100 entries
1 Target column (iris_type)
4 Numerical Features
No missing values

Visual exploration

Distribution of Sepal Length, influence on target:



In [ ]:

    
df[df['iris_type']=='virginica']['sepal_length_cm'].plot(kind='hist', bins = 10, range = (4,7),
                                                      alpha = 0.3, color = 'b')
df[df['iris_type']=='versicolor']['sepal_length_cm'].plot(kind='hist', bins = 10, range = (4,7),
                                                       alpha = 0.3, color = 'g')
plt.title('Distribution of Sepal Length', size = '20')
plt.xlabel('Sepal Length (cm)', size = '20')
plt.ylabel('Number of flowers', size = '20')
plt.legend(['Virginica', 'Versicolor'])
plt.show()

Two features combined, scatter plot:



In [ ]:

    
plt.scatter(df[df['iris_type']== 'virginica']['petal_length_cm'].values,
            df[df['iris_type']== 'virginica']['sepal_length_cm'].values, label = 'Virginica', c = 'b', s = 40)
plt.scatter(df[df['iris_type']== 'versicolor']['petal_length_cm'].values,
            df[df['iris_type']== 'versicolor']['sepal_length_cm'].values, label = 'Versicolor', c = 'r', marker='s',s = 40)
plt.legend(['virginica', 'versicolor'], loc = 2)
plt.title('Iris Flowers', size = '20')
plt.xlabel('Petal Length (cm)', size = '20')
plt.ylabel('Sepal Length (cm)', size = '20')
plt.show()

Ok, so, the flowers seem to have different characteristics

Let's build a simple model to test that

Define a new target column called target like this:

if iris_type = 'virginica' ===> target = 1
otherwise target = 0



In [ ]:

    
df['target'] = df['iris_type'].map({'virginica': 1, 'versicolor': 0})

print df[['iris_type', 'target']].head(2)
print
print df[['iris_type', 'target']].tail(2)

Define simplest model as benchmark

The simplest model is a model that predicts 0 for everybody, i.e. all versicolor.

How good is it?



In [ ]:

    
df['target'].value_counts()

If I predict every flower is Versicolor, I'm correct 50% of the time

We need to do better than that

Define features (X) and target (y) variables



In [ ]:

    
X = df[['sepal_length_cm', 'sepal_width_cm',
        'petal_length_cm', 'petal_width_cm']]
y = df['target']

Initialize a decision Decision Tree model



In [ ]:

    
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model

Split the features and the target into a Train and a Test subsets.

Ratio should be 70/30



In [ ]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size = 0.3, random_state=0)

Train the model



In [ ]:

    
model.fit(X_train, y_train)

Calculate the model score



In [ ]:

    
my_score = model.score(X_test, y_test)

print "Classification Score: %0.2f" % my_score

Print the confusion matrix



In [ ]:

    
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

print "\n=======confusion matrix=========="
print confusion_matrix(y_test, y_pred)

3) Iterate and improve

Start from:

> python iris_starter_script.py

It's a basic pipeline. How can you improve the score? Try:

Next Steps: try separating 3 classes instead of 2 (iris.csv provided)